Introduction

Airport delays are one of the most common problems people face upon travelling. These delays are usually associated with certain carriers or destinations. Thus the question is, are they really the cause of these delays or is it due to other reasons?

In 2013, data about 336,776 flights were collected to answer this question, the data collected were about flights departing from New York City across all of its airports: John F. Kennedy International Airport (JFK), Newark Liberty International Airport (EWR) and LaGuardia Airport (LGA) to destinations all over the United States and some of its territories (Puerto Rico, and the American Virgin Islands). The data was described by 19 attributes shown in Table 1 followed by a sample of the data in Table 2.


Goal

The aim behind this analysis is to confirm whether there is a relationship between New York City’s flight delays and attributes of the dataset. We hypothesize that flight distance and destination are major contributors to the delays and that different destinations with different distances will have less delays.


Methods

  • Data Visualization
  • Exploratory Data Analysis (EDA)
  • Data Munging


The Dataset

Attribute Description
year Date of departure.
month Date of departure.
day Date of departure.
dep_time Actual departure and arrival times (format HHMM or HMM), local tz.
arr_time Actual departure and arrival times (format HHMM or HMM), local tz.
sched_dep_time Scheduled departure and arrival times (format HHMM or HMM), local tz.
sched_dep_time Scheduled departure and arrival times (format HHMM or HMM), local tz.
dep_delay Departure and arrival delays, in minutes. Negative times represent early departures/arrivals.
arr_delay Departure and arrival delays, in minutes. Negative times represent early departures/arrivals.
carrier Two letter carrier abbreviation.
flight Flight number.
tailnum Plane tail number.
origin Origin and destination.
dest Origin and destination.
air_time Amount of time spent in the air, in minutes.
distance Distance between airports, in miles.
hour Time of scheduled departure broken into hour and minutes.
minute Time of scheduled departure broken into hour and minutes.
time_hour Scheduled date and hour of the flight as a POSIXct date.

Table 1: A list of all attributes and its descriptions that are present in the dataset


Table 2: A sample of the dataset and how it is structured



Now lets take a look at our data:

##           year          month            day       dep_time sched_dep_time 
##              0              0              0              0              0 
##      dep_delay       arr_time sched_arr_time      arr_delay        carrier 
##              0              0              0              0              0 
##         flight        tailnum         origin           dest       air_time 
##              0              0              0              0              0 
##       distance           hour         minute      time_hour 
##              0              0              0              0
## Rows: 327,346
## Columns: 19
## $ year           <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2…
## $ month          <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ day            <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ dep_time       <int> 517, 533, 542, 544, 554, 554, 555, 557, 557, 558, 558, …
## $ sched_dep_time <int> 515, 529, 540, 545, 600, 558, 600, 600, 600, 600, 600, …
## $ dep_delay      <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2, -2, -1…
## $ arr_time       <int> 830, 850, 923, 1004, 812, 740, 913, 709, 838, 753, 849,…
## $ sched_arr_time <int> 819, 830, 850, 1022, 837, 728, 854, 723, 846, 745, 851,…
## $ arr_delay      <dbl> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -3, 7, -1…
## $ carrier        <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV", "B6", "…
## $ flight         <int> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79, 301, 4…
## $ tailnum        <chr> "N14228", "N24211", "N619AA", "N804JB", "N668DN", "N394…
## $ origin         <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR", "LGA",…
## $ dest           <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL", "IAD",…
## $ air_time       <dbl> 227, 227, 160, 183, 116, 150, 158, 53, 140, 138, 149, 1…
## $ distance       <dbl> 1400, 1416, 1089, 1576, 762, 719, 1065, 229, 944, 733, …
## $ hour           <dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5, 6, 6, 6…
## $ minute         <dbl> 15, 29, 40, 45, 0, 58, 0, 0, 0, 0, 0, 0, 0, 0, 0, 59, 0…
## $ time_hour      <dttm> 2013-01-01 05:00:00, 2013-01-01 05:00:00, 2013-01-01 0…
##       year          month             day           dep_time    sched_dep_time
##  Min.   :2013   Min.   : 1.000   Min.   : 1.00   Min.   :   1   Min.   : 500  
##  1st Qu.:2013   1st Qu.: 4.000   1st Qu.: 8.00   1st Qu.: 907   1st Qu.: 905  
##  Median :2013   Median : 7.000   Median :16.00   Median :1400   Median :1355  
##  Mean   :2013   Mean   : 6.565   Mean   :15.74   Mean   :1349   Mean   :1340  
##  3rd Qu.:2013   3rd Qu.:10.000   3rd Qu.:23.00   3rd Qu.:1744   3rd Qu.:1729  
##  Max.   :2013   Max.   :12.000   Max.   :31.00   Max.   :2400   Max.   :2359  
##    dep_delay          arr_time    sched_arr_time   arr_delay       
##  Min.   : -43.00   Min.   :   1   Min.   :   1   Min.   : -86.000  
##  1st Qu.:  -5.00   1st Qu.:1104   1st Qu.:1122   1st Qu.: -17.000  
##  Median :  -2.00   Median :1535   Median :1554   Median :  -5.000  
##  Mean   :  12.56   Mean   :1502   Mean   :1533   Mean   :   6.895  
##  3rd Qu.:  11.00   3rd Qu.:1940   3rd Qu.:1944   3rd Qu.:  14.000  
##  Max.   :1301.00   Max.   :2400   Max.   :2359   Max.   :1272.000  
##    carrier              flight       tailnum             origin         
##  Length:327346      Min.   :   1   Length:327346      Length:327346     
##  Class :character   1st Qu.: 544   Class :character   Class :character  
##  Mode  :character   Median :1467   Mode  :character   Mode  :character  
##                     Mean   :1943                                        
##                     3rd Qu.:3412                                        
##                     Max.   :8500                                        
##      dest              air_time        distance         hour      
##  Length:327346      Min.   : 20.0   Min.   :  80   Min.   : 5.00  
##  Class :character   1st Qu.: 82.0   1st Qu.: 509   1st Qu.: 9.00  
##  Mode  :character   Median :129.0   Median : 888   Median :13.00  
##                     Mean   :150.7   Mean   :1048   Mean   :13.14  
##                     3rd Qu.:192.0   3rd Qu.:1389   3rd Qu.:17.00  
##                     Max.   :695.0   Max.   :4983   Max.   :23.00  
##      minute        time_hour                     
##  Min.   : 0.00   Min.   :2013-01-01 05:00:00.00  
##  1st Qu.: 8.00   1st Qu.:2013-04-05 06:00:00.00  
##  Median :29.00   Median :2013-07-04 09:00:00.00  
##  Mean   :26.23   Mean   :2013-07-03 17:56:45.44  
##  3rd Qu.:44.00   3rd Qu.:2013-10-01 18:00:00.00  
##  Max.   :59.00   Max.   :2013-12-31 23:00:00.00



EDA

Although 3 airports per city may seem large, larger than many of the world’s capitals, the city that never sleeps hosts only 3 out of 16 airports of the state of New York which itself is not the host of the largest number of airports per state as shown in Figure 1.


Figure 1: An interactive map showing the number of airports per state. Puerto Rico (7 airports) and U.S. Virgin Islands (2 airports) are not shown.

Based on the figure we can observe that:


The three airports of the city of New York serve different purposes–such as domestic or international–and hence should have different number of flights, let us confirm that using Figure 2.

Figure 2: Number of flights per NYC airport

From the figure above, we can see that all of the airports have relatively similar numbers with LaGuardia Airport (LGA) having the lowest, probably due the fact that it is a domestic airport. Newark Liberty International Airport (EWR) has the largest share of flights, which can be explained by looking at its location, where it lays on the border between New York state and New Jersey state, making it a strategic location and more favourable over JFK International Airport.


And despite the large number of flights and airports around the United States, Figure 3 showed that some states were never reached from NYC airports during 2013.


Figure 3: Map showing flights per state as destination

The map shows us that the most visited destination is Florida, followed by California by almost half of the number. The map also shows that there are 8 states with zero flights towards it namely: Mississippi, Kansas, Idaho, New Hampshire, New Jersey, Delaware, South and North Dakota.


Figure 4 will show us the delays per airport:

Figure 4: Number of delays per NYC airport

The figure features positive and negative points–indicating departure/arrival was before time or ahead of scheduled time. Yet it is not enough to say that a one airport will have a certain delay time.

To further investigate that, let us take a look at the attributes and how they affect each other.

Figure 5: Correlation matrix of the dataset attributes

Here we see the correlation matrix between the main numeric attributes.


Although the correlation matrix showed us the relationship between many variables, it did not mention a very important aspect of flights, time.

Figure 6: Departure delays per month


Figure 7: Arrival delays per month

We can see from the first plot on the left that the highest number of departure delays occur on month 6 (Jun), 7 (Jul), and 12 (Dec), which indicate there are more departure delays during summer and winter breaks.


Similar to the previous plot, in the second plot on the right, we see highest number of arrival delays occur on month 7 (Jul), and 12 (Dec) during the summer and winter breaks.


Figure 8: Number of flights and their departure delays month


Figure 9: Number of flights and their arrival delays month


In the first plot on the left, we can see that as the number of flights increases, the number of departure delays also increase. If we check for the months with the highest delays, 6 (Jun), 7 (Jul), and 12 (Dec). We see they also have the highest number of flights compared to other months.


The second plot is similar to the first where the number of flights increases, the number of Arrival delays also increase. Except that for month 8 (Aug), where the number of flights were high but the arrival delays were relatively lower than months with less flights.


Figure 9: Percentage of flights’ departure and arrival delays

We can see that almost 39% of the NYC flights in the year 2013 had a departure delay, only 5% departed on time and 55.9% departed before time. As for the arrival delays, we can see that it doesn’t differ that much from the departure delays.


carrier no_flights low_delay medium_delay high_delay overall_delay
UA 57782 26% 14% 7% 47%
B6 54049 17% 14% 8% 40%
EV 51108 15% 16% 13% 45%
DL 47658 16% 10% 6% 32%
AA 31947 16% 9% 6% 32%
MQ 25037 11% 13% 8% 32%
US 19831 12% 8% 4% 24%
9E 17294 15% 14% 11% 40%
WN 12044 27% 17% 9% 54%
VX 5116 26% 10% 7% 43%
FL 3175 25% 16% 10% 52%
AS 709 18% 7% 6% 32%
F9 681 22% 16% 11% 50%
YV 544 14% 14% 14% 43%
HA 342 13% 4% 3% 20%
OO 29 10% 7% 14% 31%

Table 3: Carriers, their number of flights and the percentage of flights with delays

WN carriers tend to have the highest percentage of overall delays, and US carriers tend to have a low percentage of delays compared to its number of flights.



Figure 10: Carriers, their number of flights and the percentage of flights with delays

From the horizontal stacked bar it’s clear that carriers UA, EV, B6 and DL have the highest frequency of delays, also these carriers have the highest number of flights, which can tell us that carriers having a high number of flights tend to have a high frequency of delays.


carrier highest_delay avg_delay
F9 853 20.201175
EV 548 19.838929
YV 387 18.898897
FL 602 18.605984
WN 471 17.661657
9E 747 16.439574
B6 502 12.967548
VX 653 12.756646
OO 154 12.586207
UA 483 12.016908
MQ 1137 10.445381
DL 960 9.223950
AA 1014 8.569130
AS 225 5.830748
HA 1301 4.900585
US 500 3.744693

Figure 11: Carriers and their maximum and average delays

The carrier with the highest delay time is HA with 1301 min delay and the carrier with the highest avg delay is F9 with an average of 20.2


Conclusion

Resources

Source Code

This report is hosted on Github Pages and the repo can be accessed via this link.